End-to-End Machine Learning with H2O DSC5.0 Tutorial, Part 1 (bit.ly/dsc50_h2o_tutorial)
End-to-End Machine Learning with H2O DSC5.0 Tutorial, Part 1 (bit.ly/dsc50_h2o_tutorial)
- 1 Agenda
- 2 Set Up
- 3 Introduction
- 4 Regression Part One: H2O AutoML
- 5 Regression Part Two: XAI
- 6 Coffee Break 17:45 - 18:00
1 Agenda
- 16:15 to 17:00 Set Up & Introduction
- 17:00 to 17:45 Regression Example
- 17:45 to 18:00 Coffee Break
- 18:00 to 18:45 Classification Example
- 18:45 to 19:15 Bring Your Own Data + Q&A
2 Set Up
2.1 Download -> bit.ly/dsc50_h2o_tutorial
scripts/setup.R: install packages requiredrmd/tutorial_pt1.Rmd: RMarkdown file with introduction and regression codermd/tutorial_pt2.Rmd: RMarkdown file with classification codescripts/tutorial_pt1.R: R Script file with introduction and regression codescripts/tutorial_pt2.R: R Script file with classification codetutorial.html: this webpage- Full URL https://github.com/woobe/useR2019_h2o_tutorial (if
bit.lydoesn’t work)
2.2 R Packages
- Check out
setup.R - For this tutorial:
h2ofor machine learningmlbenchfor Boston Housing datasetDALEX,ibreakDown,ingredients&pdpfor explaining model predictions
- For RMarkdown
knitrfor rendering this RMarkdownrmdformatsforreadthedownRMarkdown templateDTfor nice tables
3 Introduction
This is a hands-on tutorial for R beginners. It will demonstrate the use of H2O and other R packages for end-to-end - automatic and interpretable machine learning. Participants will be able to follow and build regression and classification models quickly with H2O.ai library. They will also be able to explain the model outcomes with various methods.
It is a workshop for R beginners and anyone interested in machine learning. RMarkdown and the rendered HTML will be provided so everyone can follow without running the code.
4 Regression Part One: H2O AutoML
4.1 Data - Boston Housing from mlbench
data("BostonHousing")
datatable(head(BostonHousing),
rownames = FALSE, options = list(pageLength = 6, scrollX = TRUE))Source: UCI Machine Learning Repository Link
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- b: 1000(Bk - 0.63)^2 where Bk is the proportion of people of African American descent by town.
- lstat: lower status of the population (percent).
- medv (This is the TARGET): median value of owner-occupied homes in $1000s.
4.2 Define Target and Features
target <- "medv" # Median House Value
features <- setdiff(colnames(BostonHousing), target)
print(features) [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "b" "lstat"
4.3 Start a local H2O Cluster (JVM)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 hours 26 minutes
H2O cluster timezone: Europe/Belgrade
H2O data parsing timezone: UTC
H2O cluster version: 3.26.0.2
H2O cluster version age: 3 months and 20 days !!!
H2O cluster name: H2O_started_from_R_branko_gkt443
H2O cluster total nodes: 1
H2O cluster total memory: 3.73 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.2 (2018-12-20)
4.4 Convert R dataframe into H2O dataframe
4.5 Split Data into Train/Test
h_split <- h2o.splitFrame(h_boston, ratios = 0.8, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation[1] 411 14
[1] 95 14
4.6 Cross-Validation
4.7 Baseline Models
h2o.glm(): H2O Generalized Linear Modelh2o.randomForest(): H2O Random Forest Modelh2o.gbm(): H2O Gradient Boosting Modelh2o.deeplearning(): H2O Deep Neural Network Modelh2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model from DMLC
4.7.1 Baseline Generalized Linear Model (GLM)
model_glm <- h2o.glm(x = features, # All 13 features
y = target, # medv (median value of owner-occupied homes in $1000s)
training_frame = h_train, # H2O dataframe with training data
model_id = "baseline_glm", # Give the model a name
nfolds = 5, # Using 5-fold CV
seed = n_seed) # Your lucky seedH2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 23.04256
RMSE: 4.800267
MAE: 3.307191
RMSLE: NaN
Mean Residual Deviance : 23.04256
R^2 : 0.7076243
Null Deviance :32617.3
Null D.o.F. :410
Residual Deviance :9470.494
Residual D.o.F. :396
AIC :2487.815
H2ORegressionMetrics: glm
MSE: 28.87315
RMSE: 5.373374
MAE: 3.859189
RMSLE: 0.1861469
Mean Residual Deviance : 28.87315
R^2 : 0.7254239
Null Deviance :10402.21
Null D.o.F. :94
Residual Deviance :2742.949
Residual D.o.F. :80
AIC :621.075
Let’s use RMSE
4.7.2 Build Other Baseline Models (DRF, GBM, DNN & XGB)
# Baseline Distributed Random Forest (DRF)
model_drf <- h2o.randomForest(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_drf",
nfolds = 5,
seed = n_seed)# Baseline Gradient Boosting Model (GBM)
model_gbm <- h2o.gbm(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_gbm",
nfolds = 5,
seed = n_seed)# Baseline Deep Nerual Network (DNN)
# By default, DNN is not reproducible with multi-core. You may get slightly different results here.
# You can enable the `reproducible` option but it will run on a single core (very slow).
model_dnn <- h2o.deeplearning(x = features,
y = target,
training_frame = h_train,
model_id = "baseline_dnn",
nfolds = 5,
seed = n_seed)4.7.3 Comparison (RMSE: Lower = Better)
# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("H2O GLM: Generalized Linear Model (Baseline)",
"H2O DRF: Distributed Random Forest (Baseline)",
"H2O GBM: Gradient Boosting Model (Baseline)",
"H2O DNN: Deep Neural Network (Baseline)",
"XGBoost: eXtreme Gradient Boosting Model (Baseline)"),
stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_glm@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_drf@model$cross_validation_metrics@metrics$RMSE
d_eval[3, ]$RMSE_cv <- model_gbm@model$cross_validation_metrics@metrics$RMSE
d_eval[4, ]$RMSE_cv <- model_dnn@model$cross_validation_metrics@metrics$RMSE
d_eval[5, ]$RMSE_cv <- model_xgb@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_glm, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_drf, newdata = h_test))
d_eval[3, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm, newdata = h_test))
d_eval[4, ]$RMSE_test <- h2o.rmse(h2o.performance(model_dnn, newdata = h_test))
d_eval[5, ]$RMSE_test <- h2o.rmse(h2o.performance(model_xgb, newdata = h_test))4.8 Manual Tuning
4.8.1 Check out the hyper-parameters for each algo
4.8.2 Train a xgboost model with manual settings
model_xgb_m <- h2o.xgboost(x = features,
y = target,
training_frame = h_train,
model_id = "model_xgb_m",
nfolds = 5,
seed = n_seed,
# Manual Settings based on experience
learn_rate = 0.1, # use a lower rate (more conservative)
ntrees = 100, # use more trees (due to lower learn_rate)
sample_rate = 0.9, # use random n% of samples for each tree
col_sample_rate = 0.9) # use random n% of features for each tree4.8.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "XGBoost: eXtreme Gradient Boosting Model (Manual Settings)",
RMSE_cv = model_xgb_m@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(model_xgb_m, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)4.9 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl = h2o.automl(x = features,
y = target,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 20, # Max number of models
stopping_metric = "RMSE", # Metric to optimize
project_name = "automl_boston", # Specify a name so you can add more models later
seed = n_seed)4.9.1 Leaderboard
4.9.2 Best Model (Leader)
Model Details:
==============
H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_BestOfFamily_AutoML_20191116_120246
NULL
H2ORegressionMetrics: stackedensemble
** Reported on training data. **
MSE: 0.4180159
RMSE: 0.6465415
MAE: 0.5062811
RMSLE: 0.03140634
Mean Residual Deviance : 0.4180159
H2ORegressionMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 7.834856
RMSE: 2.799081
MAE: 1.874201
RMSLE: 0.1211643
Mean Residual Deviance : 7.834856
4.9.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
RMSE_cv = automl@leader@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(automl@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)4.10 Make Predictions
predict
1 36.43380
2 17.12354
3 21.21855
4 18.26257
5 17.64718
6 17.80730
5 Regression Part Two: XAI
Let’s look at the first house in h_test
datatable(as.data.frame(h_test[1, ]),
rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE))5.1 Using functions in h2o
h2o.varimp()&h2o.varimp_plot: Variable Importance (for GBM, DNN, GLM)h2o.partialPlot(): Partial Dependence Plotsh2o.predict_contributions(): SHAP values (for GBM and XGBoost only)
# Look at the impact of feature `rm` (no. of rooms)
# Not Run
h2o.partialPlot(model_glm, data = h_test, cols = c("rm"))
h2o.partialPlot(model_drf, data = h_test, cols = c("rm"))
h2o.partialPlot(model_gbm, data = h_test, cols = c("rm"))
h2o.partialPlot(model_dnn, data = h_test, cols = c("rm"))
h2o.partialPlot(model_xgb, data = h_test, cols = c("rm"))
h2o.partialPlot(automl@leader, data = h_test, cols = c("rm"))5.2 Package DALEX
- Website: https://pbiecek.github.io/DALEX/
- Original DALEX-H2O Example: https://raw.githack.com/pbiecek/DALEX_docs/master/vignettes/DALEX_h2o.html
5.2.1 The explain() Function
The first step of using the DALEX package is to wrap-up the black-box model with meta-data that unifies model interfacing.
To create an explainer we use explain() function. Validation dataset for the models is h_test from part one. For the models created by h2o package we have to provide custom predict function which takes two arguments: model and newdata and returns a numeric vector with predictions.
5.2.2 Explainer for H2O Models
explainer_drf <- DALEX::explain(model = model_drf,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "Random Forest")Preparation of a new explainer is initiated
-> model label : Random Forest
-> data : 95 rows 13 cols
-> target variable : 95 values
-> predict function : custom_predict
-> predicted values : numerical, min = 9.674672 , mean = 23.89372 , max = 46.382
-> residual function : difference between y and yhat ( [33m default [39m )
-> residuals : numerical, min = -11.4882 , mean = 0.3315435 , max = 9.9948
-> model_info : package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression ( [33m default [39m )
[32m A new explainer has been created! [39m
explainer_dnn <- DALEX::explain(model = model_dnn,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "Deep Neural Networks")Preparation of a new explainer is initiated
-> model label : Deep Neural Networks
-> data : 95 rows 13 cols
-> target variable : 95 values
-> predict function : custom_predict
-> predicted values : numerical, min = 11.47347 , mean = 25.9381 , max = 52.32421
-> residual function : difference between y and yhat ( [33m default [39m )
-> residuals : numerical, min = -25.15236 , mean = -1.712836 , max = 8.32301
-> model_info : package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression ( [33m default [39m )
[32m A new explainer has been created! [39m
explainer_xgb <- DALEX::explain(model = model_xgb,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "XGBoost")Preparation of a new explainer is initiated
-> model label : XGBoost
-> data : 95 rows 13 cols
-> target variable : 95 values
-> predict function : custom_predict
-> predicted values : numerical, min = 8.771681 , mean = 24.31138 , max = 50.43563
-> residual function : difference between y and yhat ( [33m default [39m )
-> residuals : numerical, min = -20.09026 , mean = -0.08612126 , max = 9.010389
-> model_info : package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression ( [33m default [39m )
[32m A new explainer has been created! [39m
explainer_automl <- DALEX::explain(model = automl@leader,
data = as.data.frame(h_test)[, features],
y = as.data.frame(h_test)[, target],
predict_function = custom_predict,
label = "H2O AutoML")Preparation of a new explainer is initiated
-> model label : H2O AutoML
-> data : 95 rows 13 cols
-> target variable : 95 values
-> predict function : custom_predict
-> predicted values : numerical, min = 8.667347 , mean = 24.35622 , max = 49.94388
-> residual function : difference between y and yhat ( [33m default [39m )
-> residuals : numerical, min = -13.3916 , mean = -0.130959 , max = 9.976462
-> model_info : package Model of class: H2ORegressionModel package unrecognized , ver. Unknown , task regression ( [33m default [39m )
[32m A new explainer has been created! [39m
5.2.3 Variable importance
Using he DALEX package we are able to better understand which variables are important.
Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.
This method is implemented in the ingredients::feature_importance() function.
library(ingredients)
vi_drf <- feature_importance(explainer_drf, type="difference")
vi_dnn <- feature_importance(explainer_dnn, type="difference")
vi_xgb <- feature_importance(explainer_xgb, type="difference")
vi_automl <- feature_importance(explainer_automl, type="difference")5.2.4 Partial Dependence Plots
Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome. Function variable_response() with the parameter type = “pdp” calls pdp::partial() function to calculate PDP response.
Let’s look at feature rm (no. of rooms)
pdp_drf_rm <- partial_dependency(explainer_drf, variables = "rm")
pdp_dnn_rm <- partial_dependency(explainer_dnn, variables = "rm")
pdp_xgb_rm <- partial_dependency(explainer_xgb, variables = "rm")
pdp_automl_rm <- partial_dependency(explainer_automl, variables = "rm")
plot(pdp_drf_rm, pdp_dnn_rm, pdp_xgb_rm, pdp_automl_rm)5.2.5 Prediction Understanding
# Predictions from different models
yhat <- data.frame(model = c("H2O DRF: Distributed Random Forest (Baseline)",
"H2O DNN: Deep Neural Network (Baseline)",
"XGBoost: eXtreme Gradient Boosting Model (Baseline)",
"Best Model from H2O AutoML"))
yhat$prediction <- NA
yhat[1,]$prediction <- as.matrix(h2o.predict(model_drf, h_test[1,]))
yhat[2,]$prediction <- as.matrix(h2o.predict(model_dnn, h_test[1,]))
yhat[3,]$prediction <- as.matrix(h2o.predict(model_xgb, h_test[1,]))
yhat[4,]$prediction <- as.matrix(h2o.predict(automl@leader, h_test[1,]))
# Show the predictions
datatable(yhat, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE)) %>%
formatRound(columns = -1, digits = 3)The function break_down() is a wrapper around the iBreakDown package. Model prediction is visualized with Break Down Plots, which show the contribution of every variable present in the model. Function break_down() generates variable attributions for selected prediction. The generic plot() function shows these attributions.
library(iBreakDown)
sample <- as.data.frame(h_test)[1, ] # Using the first sample from h_test
pb_drf <- break_down(explainer_drf, new_observation = sample)
pb_dnn <- break_down(explainer_dnn, new_observation = sample)
pb_xgb <- break_down(explainer_xgb, new_observation = sample)
pb_automl <- break_down(explainer_automl, new_observation = sample)